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Abstract. We propose a new automaton model, called quantified data 
automata over words, that can model quantified invariants over linear 
data structures, and build poly-time active learning algorithms for them, 
where the learner is allowed to query the teacher with membership and 
equivalence queries. In order to express invariants in decidable logics, we 
invent a decidable subclass of QDAs, called elastic QDAs, and prove that 
every QDA has a unique minimally-over-approximating elastic QDA. We 
then give an application of these theoretically sound and efficient active 
learning algorithms in a passive learning framework and show that we can 
efficiently learn quantified linear data structure invariants from samples 
obtained from dynamic runs for a large class of programs. 

1 Introduction 

Synthesizing invariants for programs is one of the most challenging problems in 
verification today. In this paper, we are interested in using learning techniques 
to synthesize quantified data-structure invariants. 

In an active black-box learning framework, we look upon the invariant as a 
set of configurations of the program, and allow the learner to query the teacher 
for membership and equivalence queries on this set. Furthermore, we fix a par- 
ticular representation class for these sets, and demand that the learner learn the 
smallest (simplest) representation that describes the set. A learning algorithm 
that learns in time polynomial in the size of the simplest representation of the 
set is desirable. In passive black-box learning, the learner is given a sample of 
examples and counter-examples of configurations, and is asked to synthesize the 
simplest representation that includes the examples and excludes the counter- 
examples. In general, several active learning algorithms that work in polynomial 
time arc known (e.g., learning regular languages represented as DFAs [T]) while 
passive polynomial-time learning is rare (e.g., conjunctive Boolean formulas can 
be learned but general Boolean formulas cannot be learned efficiently, automata 
cannot be learned passively efficiently) [2]. 

In this paper, we build active learning algorithms for quantified logical for- 
mulas describing sets of linear data-structures. Our aim is to build algorithms 
that can learn formulas of the kind "Vyi, . . . y.k where ip is quantifier- free, 
and that captures properties of arrays and lists (the variables range over indices 
for arrays, and locations for lists, and the formula can refer to the data stored at 
these positions and compare them using arithmetic, etc.). Furthermore, we show 
that we can build learning algorithms that learn properties that are expressible 



in known decidable logics. We then employ the active learning algorithm in a 
passive learning setting where we show that by building an imprecise teacher 
that answers the questions of the active learner, we can build effective invariant 
generation algorithms that learn simply from a finite set of examples. 
Active Learning of Quantified Properties using QDAs: Our first tech- 
nical contribution is a novel representation (normal form) for quantified prop- 
erties of linear data-structures, called quantified data automata (QDA), and a 
polynomial-time active learning algorithm for QDAs. 

We model linear data-structures as data words, where each position is deco- 
rated with a finite alphabet modeling the program's pointer variables that point 
to that cell in the list or index variables that index into the cell of the array, and 
with data modeling the data value stored in the cell, e.g. integers. 

Quantified data automata (QDA) are a new model of automata over data 
words that are powerful enough to express universally quantified properties of 
data words. A QDA accepts a data word provided it accepts all possible an- 
notations of the data word with valuations of a (fixed) set of variables Y = 
{yi, . . . , j/fc}; for each such annotation, the QDA reads the data word, records 
the data stored at the positions pointed to by Y, and finally checks these data 
values against a data formula determined by the final state reached. QDAs are 
very powerful in expressing typical invariants of programs manipulating lists 
and arrays, including invariants of a wide variety of searching and sorting al- 
gorithms, maintenance of lists and arrays using insertions/deletions, in-place 
manipulations that destructively update lists, etc. 

We develop an efficient learning algorithm for QDAs. By using a combination 
of abstraction over a set of data formulas and Angluin's learning algorithm for 
DFAs [I], we build a learning algorithm for QDAs. We first show that for any 
set of valuation words (data words with valuations for the variables Y), there is 
a canonical QDA. Using this result, we show that learning valuation words can 
be reduced to learning formula words (words with no data but paired with data 
formulas), which in turn can be achieved using Angluin-style learning of Moore 
machines. The number of queries the learner poses and the time it takes is bound 
polynomial in the size of the canonical QDA that is learned. Intuitively, given 
a set of pointers into linear data structures, there are an exponential number of 
ways to permute the pointers into these and the universally quantified variables; 
the learning algorithm allows us to search this space using only polynomial time 
in terms of the actual permutations that figure in the set of data words learned. 
Elastic QDAs and a Unique Minimal Over- Approximation Theorem: 
The quantified properties that we learn in this paper (we can synthesize them 
from QDAs) are very powerful, and are, in general undecidable. Consequently, 
even if they are learnt in an invariant-learning application, we will be unable to 
verify automatically whether the learnt properties are adequate invariants for 
the program at hand. The goal of this paper is to also offer mechanisms to learn 
invariants that are amenable to decision procedures. 

The second technical contribution of this paper is to identify a subclass of 
QDAs (called elastic QDAs) and show two main results for them: (a) elastic 



QDAs can be converted to decidable logical formulas, to the array property frag- 
ment when modeling arrays and the decidable Strand fragment when modeling 
lists; (b) a surprising unique minimal over-approximation theorem that says that 
for every QDA, accepting say a language L of valuation-words, there is a mini- 
mal (with respect to inclusion) language of valuation-words L' that is accepted 
by an elastic QDA. 

The latter result allows us to learn QDAs and then apply the unique minimal 
over-approximation (which is effective) to compute the best over-approximation 
of it that can be expressed by elastic QDAs (which then is decidable). The 
result is proved by showing that there is a unique way to minimally morph a 
QDA to one that satisfies the elasticity restrictions. For the former, we identify 
a common property of the array property fragment and the syntactic decidable 
fragment of Strand, called elasticity (following the general terminology on the 
literature on Strand [3]). Intuitively, both the array property fragment and 
Strand prohibit quantified cells to be tested to be bounded distance away (the 
array property fragment does this by disallowing arithmetic expressions over the 
quantified index variables [I] and the decidable fragment of Strand disallows 
this by permitting only the use of — >* or — > + in order to compare quantified 
variables |3l5j V We finally identify a structural restriction of QDAs that permits 
only elastic properties to be stated. that there is a unique way to minimally 
morph a QDA to one that satisfies the elasticity restrictions. 
Passive Learning of Quantified Properties: The active learning algorithm 
can itself be used in a verification framework, where the membership and equiv- 
alence queries are answered using under-approximate and deductive techniques 
(for instance, for itcratively increasing values of k, a teacher can answer mem- 
bership questions based on bounded and reverse-bounded model-checking, and 
answer equivalence queries by checking if the invariant is adequate using a con- 
straint solver; see Appendix D for details). In this paper, we do not pursue an 
implementation of active learning as above, but instead build a passive learning 
algorithm that uses the active learning algorithm. 

Our motivation for doing passive learning is that we believe (and we val- 
idate this belief using experiments) that in many problems, a lighter-weight 
passive-learning algorithm which learns from a few randomly-chosen small data- 
structures is sufficient to divine the invariant. Note that passive learning algo- 
rithms, in general, often boil down to a guess-and-check algorithm of some kind, 
and often pay an exponential price in the property learned. Designing a passive 
learning algorithm using an active learning core allows us to build more inter- 
esting algorithms; in our algorithm, the inacurracies/guessing is confined to the 
way the teacher answers the learner's questions. 

The passive learning algorithm works as follows. Assume that we have a finite 
set of configurations S, obtained from sampling the program (by perhaps just 
running the program on various random small inputs). We are required to learn 
the simplest representation that captures the set S (in the form of a QDA). 
We now use an active learning algorithm for QDAs; membership questions are 
answered with respect to the set S (note that this is imprecise, as an invariant / 



must include S but need not be precisely S). When asked an equivalence query 
with a set /, we check whether S C I; if yes, we can check if the invariant is 
adequate using a constraint-solver and the program. 

It turns out that this is a good way to build a passive learning algorithm. 
First, enumerating random small data-structures that get manifest at the header 
of a loop fixes for the most part the structure of the invariant, since the invariant 
is forced to be expressed as a QDA. Second, our active learning algorithm for 
QDAs promises never to ask long membership queries (queried words are guar- 
anteed to be less than the diameter of the automaton) , and often the teacher has 
the correct answers. Finally, note that the passive learning algorithm answers 
membership queries with respect to S; this is because we do not know the true 
invariant, and hence err on the side of keeping the invariant semantically small. 
This inaccuracy is common in most learning algorithms employed for verification 
(e.g, Boolean learning 6 , compositional verification [7 8 , etc). This inaccuracy 
could lead to a non-optimal QDA being learnt, and is precisely why our algo- 
rithm need not work in time polynomial in the simplest representation of the 
concept (though it is polynomial in the invariant it finally learns). 

The proof of the efficacy of the passive learning algorithm rests in the exper- 
imental evaluation. We implement the passive learning algorithm (which in turn 
uses the active learning algorithm). By building a teacher using dynamic test 
runs of the program and by pitting this teacher against the learner, we learn 
invariant QDAs, and then over-approximate them using EQDAs. These EQ- 
DAs are then transformed into formulas over decidable theories of arrays and 
lists. Using a wide variety of programs manipulating arrays and lists, ranging 
from several examples in the literature involving sorting algorithms, partitioning, 
merging lists, reversing lists, and programs from the Glib list library, programs 
from the Linux kernel, a device driver, and programs from a vcrificd-for-security 
mobile application platform, we show that we can effectively learn adequate 
quantified invariants in these settings. In fact, since our technique is a black-box 
technique, we show that it can be used to infer pre-conditions/post-conditions 
for methods as well. 

Related Work: For invariants expressing properties on the dynamic heap, 
shape analysis techniques are the most well known [S] , where locations are clas- 
sified/merged using unary predicates (some dictated by the program and some 
given as instrumentation predicates by the user), and abstractions summarize 
all nodes with the same predicates into a single node. The data automata that 
we build also express an infinite set of linear data structures, but do so using 
automata, and further allow n-ary quantified relations between data elements. 
In recent work, |10j describes an abstract domain, for analyzing list manipulat- 
ing programs, that can capture quantified properties about the structure and 
the data stored in lists. This domain can be instantiated with any numerical 
domain for the data constraints and a set of user-provided patterns for captur- 
ing the structural constraints. However, providing these patterns for quantified 
invariants is in general a difficult task. 



In recent years, techniques based on Craig's interpolation jllj have emerged 
as a new method for invariant synthesis. Interpolation techniques, which are 
inherently white-box as well, are known for several theories, including linear 
arithmetic, uninterpreted function theories, and even quantified properties over 
arrays and lists [12113114115] . These methods use different heuristics like term 
abstraction [2] , preferring smaller constants [12113) and use of existential ghost 
variables 15. to ensure that the interpolant converges on an invariant from a 
finite set of spurious counter-examples. IC3 [16) is another white-box technique 
for generalizing inductive invariants from a set of counter-examples. 

A primary difference in our work, compared to all the work above, is that 
ours is a black-box technique that does not look at the code of the program, but 
synthesizes an invariant from a snapshot of examples and counter-examples that 
characterize the invariant. The black-box approach to constructing invariants has 
both advantages and disadvantages. The main disadvantage is that information 
regarding what the program actually does is lost in invariant synthesis. However, 
this is the basis for its advantage as well — by not looking at the code, the 
learning algorithm promises to learn the sets with the simplest representations 
in polynomial time, and can also be much more flexible. For instance, even when 
the code of the program is complex, for example having non-linear arithmetic or 
complex heap manipulations that preclude logical reasoning, black-box learning 
gives ways to learn simple invariants for them. 

There are several black-box learning algorithms that have been explored in 
verification. Boolean formula learning has been investigated for finding quantifier- 
free program invariants [17] , and also extended to quantified invariants [B] . How- 
ever unlike us, [6] learns a quantified formula given a set of data predicates as 
also the predicates which can appear in the guards of the quantified formula. 
Recently, machine learning techniques have also been explored [18j . Variants of 
the Houdini algorithm |19j essentially use conjunctive Boolean learning (which 
can be achieved in polynomial time) to learn conjunctive invariants over tem- 
plates of atomic formulas (see also [20.). The most mature work in this area is 
Daikon |21j , which learns formulas over a template, by enumerating all formulas 
and checking which ones satisfy the samples, and where scalability is achieved 
in practice using several heuristics that reduce the enumeration space which 
is doubly-exponential. For quantified invariants over data-structures, however, 
such heuristics aren't very effective, and Daikon often restricts learning only to 
formulas of very restricted syntax, like formulas with a single atomic guard, etc. 
In our experiments Daikon was, for instance, not able to learn the loop invariant 
for the selection sort algorithm. 

2 Overview 

List and Array Invariants: Consider a typical invariant in a sorting program 
over lists where the loop invariant is expressed as: 

head^*i A Vyi, y 2 .{{head ->* yi Asucc(yi, y 2 )Aj/ 2 ->* i) =>■ d(yi) < d{y 2 )) (1) 
This says that for all cells y± that occur somewhere in the list pointed to by head 
and where y% is the successor of yi, and where y\ and y 2 are before the cell 



pointed to by a scalar pointer variable i, the data value stored at j/i is no larger 
than the data value stored at y 2 . This formula is not in the decidable fragment 
of Strand since the universally quantified variables are involved in a non-elastic 
relation succ (in the subformula succ(y\, 2/2))- Such an invariant for a program 
manipulating arrays can be expressed as: 

Vtfi,i&.((Q < yi A y 2 = Vl + 1 A y 2 < i) A[ Vl ] < A[y 2 \) (2) 
Note that the above formula is also not in the decidable array property fragment. 

Quantified Data Automata: The key idea in this paper is an automaton 
model for expressing such constraints called quantified data automata (QDA). 
The above two invariants are expressed by the following QDA: 
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The above automaton reads (deterministically) data-words whose labels de- 
note the positions pointed to by the scalar pointer variables head and i, as well as 
valuations of the quantified variables y\ and y 2 . We use two blank symbols that 
indicate that no scalar variable ("6") or no variable from Y ("— ") is read in the 
corresponding component; moreover, b — (b, — ). Missing transitions go to a sink 
state labeled false. The above automaton accepts a data- word w with a valuation 
v for the universally quantified variables y\ and y 2 as follows: it stores the value 
of the data at y\ and y 2 in two registers, and then checks whether the formula 
annotating the final state it reaches holds for these data values. The automaton 
accepts the data word w if for all possible valuations of 2/1 and y 2 , the au- 
tomaton accepts the corresponding word with valuation. The above automaton 
hence accepts precisely those set of data words that satisfy the invariant formula. 

Decidable Fragments and Elastic Quantified Data Automata: The empti- 
ness problem for QDAs is undecidable; in other words, the logical formulas that 
QDAs express fall into undecidable theories of lists and arrays. A common restric- 
tion in the array property fragment as well as the syntactic decidable fragments 
of Strand is that quantification is not permitted to be over elements that are 
only a bounded distance away. The restriction allows quantified variables to only 
be related through elastic relations (following the terminology of Strand |3l5j ). 

For instance, a formula equivalent to the formula in Eq. 1 but expressed in 
the decidable fragment of Strand over lists is: 

head^*i A Vyi,y 2 .((head — >* 2/1 A y\ —}* y 2 A y 2 — >* i) =>- d(yx) < d(y 2 )) (3) 
This formula compares data at j/i and y 2 whenever y 2 occurs sometime after 
yi, and this makes the formula fall in a decidable class. Similarly, a formula 
equivalent to the formula Eq. 2 in the decidable array property fragment is: 



Vyi,2/ 2 -((0 < yi Ayi < y 2 Ay 2 < i) => A[ yi ] < A[y 2 j) (4) 
The above two formulas are captured by a QDA that is the same as in the figure 
above, except that the fe-transition from q 2 to is replaced by a 6-loop on q 2 . 

We identify a restricted form of quantified data automata, called elastic quan- 
tified data automata (EQDA) in Section [SJ which structurally captures the con- 
straint that quantified variables can be related only using elastic relations (like 
— and <). Furthermore, we show in Section [5] that EQDAs can be converted to 
formulas in the decidable fragment of Strand and the array property fragment, 
and hence expresses invariants that are amenable to decidable analysis across 
loop bodies. 

It is important to note that QDAs are not necessarily a blown-up version of 
the formulas they correspond to. For a formula, the corresponding QDA can be 
exponential, but for a QDA the corresponding formula can be exponential as well 
(QDAs are like BDDs, where there is sharing of common suffixes of constraints, 
which is absent in a formula). 

3 Quantified Data Automata 

We model lists (and finite sets of lists) and arrays that contain data over some 
data domain D as finite words, called data words, encoding the pointer variables 
and the data values. Consider a finite set of pointer variables PV — {p\, . . . ,p r } 
and let E = 2 PV . The empty set corresponds to a blank symbol indicating that 
no pointer variable occurs at this position. We also denote this blank symbol by 
the letter b. A data word over PV and the data domain D is an element w of 
(E x D)* , such that every p £ PV occurs exactly once in the word (i.e., for each 
p £ PV, there is precisely one j such that w[j] — (X, d), with p £ X and d £ D). 

Let us fix a set of variables Y. The automata we build accept a data word if for 
all possible valuations of Y over the positions of the data word, the data stored at 
these positions satisfy certain properties. For this purpose, the automaton reads 
data words extended by valuations of the variables in Y, called valuation words. 
The variables are then quantified universally in the semantics of the automaton 
model (as explained later in this section). 

A valuation word is a word v £ (ExDx (FU{— }))*, where v projected to the 
first and second components forms a data word and where each y £ Y occurs in 
the third component of a letter precisely once in the word. The symbol ' — 'is used 
for the positions at which no variable from Y occurs. A valuation word hence 
defines a data word along with a valuation of Y. The data word corresponding 
to such a word v is the word in {E x D)* obtained by projecting it to its first and 
second components. Note that the choice of the alphabet enforces the variables 
from Y to be in different positions. 

To express the properties on the data, we fix a set of constants, functions 
and relations over D. We assume that the quantifier-free first-order theory over 
this domain is decidable. We encourage the reader to keep in mind the theory 
of integers with constants (0, 1, etc.), addition, and the usual relations (<, <, 
etc.) as a standard example of such a domain. 



Quantified data automata use a finite set F of formulas over the atoms 
d(yi), . . . , d(y n ) that is additionally equipped with a (semi-)lattice structure of 
the form T : (7 1 , C, U, false, true) where C is the partial-order relation, U is 
the least-upper bound, and false and true are formulas required to be in F 
and correspond to the bottom and top elements of the lattice. Furthermore, we 
assume that whenever a C (3, then a => [3. Also, we assume that each pair of 
formulas in the lattice are inequivalent. 

One example of such a formula lattice over the data domain of integers can 
be obtained by taking the set of all possible inequivalent Boolean formulas over 
the atomic formulas involving no constants, defining a C (3 iff a => (3, and 
taking the least-upper bound of two formulas as the disjunction of them. Such 
a lattice would be of size doubly exponential in the number of variables n, and 
consequently, in practice, we may want to use a different coarser lattice, such as 
the Cartesian formula lattice. The Cartesian formula lattice is formed over a set 
of atomic formulas and consists of conjunctions of literals (atoms or negations 
of atoms). The least-upper bound of two formulas is the conjunction of those 
literals that occur in both formulas. For the ordering we define a C j3 if all literals 
appearing in a also appear in (3. The size of a Cartesian lattice is exponential in 
the number of literals. 

We are now ready to introduce the automaton model. A quantified data 
automaton (QDA) over a set of program variables PV , a data domain D, a 
set of universally quantified variables Y, and a formula lattice T is of the form 
-4 = (Q, qo, n, 6, f) where Q is a finite set of states, qo G Q is the initial state, 
77 = E x (Y U{ — }), S : Q x II — > Q is the transition function, and / : Q — > F is a 
final- evaluation function that maps each state to a data formula. The alphabet 
77 used in a QDA does not contain data. Words over 77 are referred to as 
symbolic words because they do not contain concrete data values. The symbol 
(6, — ) indicating that a position does not contain any variable is denoted by b. 

Intuitively, a QDA is a register automaton that reads the data word extended 
by a valuation that has a register for each y G Y, which stores the data stored 
at the positions evaluated for Y, and checks whether the formula decorating the 
final state reached holds for these registers. It accepts a data word w G (E x D)* 
if it accepts all possible valuation words v extending w with a valuation over Y. 

We formalize this below. A configuration of a QDA is a pair of the form 
(q, r) where q G Q and r : Y — > D is a partial variable assignment. The initial 
configuration is (gcb^o) where the domain of r is empty. For any configuration 
(q,r), any letter a G E, data value d G D, and variable y G Y we define 
S'((q, r), (a,d 7 y)) = (q',r') provided S(q, (a,y)) — q' and r'(y') = r(y') for each 
y' ¥= V and r '(y) = d , an d we let 5'((q, r), (a, d, -)) = (q', r) if S(q, (a, -)) = q'. 
We extend this function 6' to valuation words in the natural way. 

A valuation word v is accepted by the QDA if S'((qo,ro),v) = (q,r) where 
(qo,ro) is the initial configuration and r \= f(q), i.e., the data stored in the 
registers in the final configuration satisfy the formula annotating the final state 
reached. We denote the set of valuation words accepted by A as L V (A). We 
assume that a QDA verifies whether its input satisfies the constraints on the 



number of occurrences of variables from PV and Y, and that all inputs violating 
these constraints cither do not admit a run (because of missing transitions) or 
are mapped to a state with final formula false. 

A data word w is accepted by the QDA if for every valuation word v such 
that the data word corresponding to v is w, v is accepted by the QDA. The 
language of the QDA, L(A), is the set of data words accepted by it. 

4 Learning Quantified Data Automata 

Our goal in this section is to synthesize QDAs using existing learning algorithms 
such as Angluin's algorithm [T], which was developed to infer the canonical 
deterministic automaton for a regular language. Therefore, we begin this section 
by analyzing the notion of canonicity for QDAs. 

Recall that QDAs define two kinds of languages, a language of data words 
and a language of valuation words. In Appendix A, we show that on the level 
of data languages we cannot have unique minimal automata. However, on the 
level of valuation words there exists a canonical automaton. This is because the 
automaton model is deterministic and, since all universally quantified variables 
are in different positions, the automaton cannot derive any relation on the data 
values during its run. Formally, we can state the following theorem, under the 
assumption that all formulas in the lattice are pairwise non-equivalent. 

Theorem 1. For each QDA A there is a unique minimal QDA A' that accepts 
the same set of valuation words. 

Proof Consider a language L v of valuation words that can be accepted by a 
QDA, and let w G 77* be a symbolic word. Then there must be a formula in the 
lattice that characterizes precisely the data extensions v of w such that v in L v . 
Since we assume that all the formulas in the lattice are pairwise non-equivalent, 
this formula is uniquely determined. In fact, take any QDA A that accepts L v . 
Then w leads to some state q in A that outputs some formula f{q). If w leads 
to any other formula in another QDA A', then A' accepts a different language 
of valuation words. 

Thus, a language of valuation words can be seen as a function that assigns to 
each symbolic word a uniquely determined formula, and a QDA can be viewed 
as a Moore machine that computes this function. For each such Moore machine 
there exists a unique minimal one that computes the same function, hence the 
theorem. □ 

As the proof above shows, we can view a language of valuation words as a 
function that maps to each symbolic word a uniquely determined formula, and 
a QDA can be viewed as a Moore machine (an automaton with output function 
on states) that computes this function. 

Our goal is to use existing learning algorithms for Moore machines to learn 
QDAs. To this end, we need to separate the structure of valuation words (the 
length of the words, the cells the pointer variables point to) from the data con- 
tained in the cells of the words. We do so by introducing formula words. 

A formula word over PV, J 7 , and Y is a word over (77* x J 7 ) where, as before, 
77 = £ x (Y~ U { — }) and each p e PV and y G Y occurs exactly once in the 



word. Note that a formula word does not contain elements of the data domain 
- it simply consists of the symbolic word that depicts the pointers into the list 
(modeled using S) and a valuation for the quantified variables in Y (modeled 
using the second component) as well as a formula over the lattice T . For example, 
(({h},yi)(b,—)(b,y2)({t},—),d(yi) < d(2/a)) is a formula word, where h points 
to the first element, t to the last element, y\ points to the first element, and 1/2 
to the third element; and the data formula is d{y\) < ^(2/2)- 

By using formula words we explicitly take the view of a QDA as a Moore 
machine that reads symbolic words and outputs data formulas. A formula word 
(u, a) is accepted by a QDA AHA reaches the state q after reading u and 
f(q) = a. Hence, a QDA defines a unique language of formula words in the 
usual way. By means of formula words, we can reduce the problem of learning 
QDAs to the problem of learning Moore machines. Next, we briefly sketch the 
learning framework we use for learning QDAs. 

Actively learning QDAs: Angluin [T] introduced a popular learning frame- 
work in which a learner learns a regular language L, the so-called target language, 
over an a priory fixed alphabet S by actively querying a teacher which is capable 
of answering membership and equivalence queries. Angluin's algorithm learns a 
regular language in time polynomial in the size of the (unique) minimal deter- 
ministic finite automaton accepting the target language and the length of the 
longest counterexample returned by the teacher. This algorithm can however be 
easily lifted to the learning of Moore machines (see Appendix B for details). 
Membership queries now ask for the output or classification of a word. On an 
equivalence query, the teacher says "yes" or returns a counter-example w such 
that the output of the conjecture on w is different from the output on w in the 
target language. As QDAs can viewed as Moore languages (since it's just a set 
of words with output being data- formulas), we can apply Angluin's algorithm 
directly in order to learn a QDA, and obtain the following theorem. 

Theorem 2. Given a teacher for a QDA- acceptable language of formula words 
that can answer membership and equivalence queries, the unique minimal QDA 
for this language can be learned in time polynomial in this minimal QDA and 
the length of the longest counterexample returned by the teacher. 

5 Unique Over-approximation Using Elastic QDAs 

Our aim is to translate the QDAs, that are synthesized, into decidable logics such 
as the decidable fragment of Strand or the array property fragment. A property 
shared by both logics is that they cannot test whether two universally quantified 
variables are bounded distance away. We capture this type of constraint by the 
subclass of elastic QDAs (EQDAs) that have been already informally described 
in Section [2] Formally, a QDA A is called elastic if each transition on b is a self 
loop, that is, whenever 5(q,b) = q' is defined, then q = q' . 

The learning algorithm that we use to synthesize QDAs does not construct 
EQDAs in general. However, we can show the following surprising result that 
every QDA A can be uniquely over-approximated by a language of valuation 
words that can be accepted by an EQDA A e i- We will refer to this construction, 



which we outline below, as elastification. This construction crucially relies on 
the particular structure that elastic automata have, which forces a unique set of 
words to be added to the language in order to make it elastic. 

b * 

Let A — (Q,qo,n,5, /) be a QDA and for a state q let Rb(q) ■= {q' \ q ^ 
q'} be the set of states reachable from q by a (possibly empty) sequence of 
b- transitions. For a set S C Q we let Rb{S) :— U g es Rb(q)- 

The set of states of A e i consists of sets of states of A that are reachable from 
the initial state Rb(qo) of Aei by the following transition function (where S(S, a) 
denotes the standard extension of the transition function of A to sets of states): 

'Rb(S(S,a)) lia^b 
5 e \(S, a) = < S if a = b and 5(q,b) is defined for some q E S 

undefined otherwise. 

V 

Note that this construction is similar to the usual powerset construction except 
that in each step we apply the transition function of A to the current set of 
states and take the 6-closure. However, if the input letter is b, A c \ loops on the 
current set if a fe-transition is defined for some state in the set. 

The final evaluation formula for a set is the least upper bound of the formulas 
for the states in the set: f e i(S) = \_\ qe sf(l)- We can now show that A e \ is the 
most precise over-approximation of the language of valuation words accepted by 
QDA A. 

Theorem 3. For every QDA A, the EQDA A e i satisfies L V (A) C L v (A e i), and 
for every EQDA B such that L V (A) C L V (B), L v (A e i) Q L V (B) holds. 

Proof: Note that Ae\ is elastic by definition of <5 e i- It is also clear that L V (A) C 
L v (Aei) because for each run of A using states qo ■ ■ ■ q n the run of ,4ei on the 
same input uses sets So ■ ■ ■ S n such that qi E Si, and by definition f(q n ) implies 

fcl(S n ). 

Now let B be an EQDA with L V (A) C L V (B). Let w = (ai,di) • • • (a n ,d n ) E 
L V (A C \) and let S be the state of .Aei reached on w. We want to show that 
w E L V (B). Let p be the state reached in B on w. We show that f(q) implies 
,fs(p) for each q E S. From this we obtain f e \(S) => Jb{p) because f e \(S) is the 
least formula that is implied by all the f(q) for q E S. 

Pick some state q E S. By definition of S e \ we can construct a valuation word 
w' that leads to the state q in A and has the following property: if all letters of 
the form (b, d) are removed from w and from w' , then the two remaining words 
are the same. In other words, w and w' can be obtained from each other by 
inserting and/or removing b- letters. 

Since B is elastic, w' also leads to p in B. From this we can conclude that 
f(q) => f(p) because otherwise there would be a model of /(g) that is not a model 
of f(p) and by changing the data values in w' accordingly we could produce an 
input that is accepted by A and not by B. □ 

6 Linear Data-structures to Words and EQDAs to Logics 



In this section, we sketch briefly how to model arrays and lists as data-words, 
and how to convert EQDAs to quantified logical formulas in decidable logics. 



Modeling Lists and Arrays as Data Words 

We model a linear data structure as a word over (E x D) with E = 2 PV , where 
PV is the set of pointer variables and D is the data domain; scalar variables 
in the program are modeled as single element lists. The encoding introduces a 
special pointer variable nil which is always read in the beginning of the data 
word together with all other null-pointers in the configuration. For arrays, the 
encoding also introduces nilJe_zero and nil_geq_size which are read together 
with all those index variables which are less than zero or which exceed the size of 
the respective array. The data value at these variables is not important; they can 
be populated with any data value in D. Given a configuration, the corresponding 
data words read the scalar variables and the linear data structures one after 
the other, in some pre-determined order. In programs like copying one array 
to another, where both the arrays are read synchronously, the encoding models 
multiple data structures as a single structure over an extended data domain. 

From EQDAs to Strand and APF 

Now we briefly sketch the translation from an EQDA A to an equivalent formula 
T(A) in Strand or the APF such that the set of data words accepted by A 
corresponds to the program configurations C which model T(A). 

Given an EQDA A, the translation enumerates all simple paths in the au- 
tomaton to an output state. For each such path p from the initial state to an 
output state q p , the translation records the relative positions of the pointer 
and universal variables as a structural constraint <f> p and the formula f^(q p ) 
relating the data value at these positions. Each path thus leads to a univer- 
sally quantified implication of the form VY. <fi p =>■ fj^(q p ). All valuation words 
not accepted by the EQDA semantically go to the formula false, hence an ad- 
ditional conjunct VY. ~~>(Vp 0p) =^ false is added to the formula. So the final 
formula T(A) = f\ p VY. </> p jU(%>) A VY. -i(V p <f> p ) => false. See Appendix C 
for more details. 



tfi ■= d(yi) < d(y 2 ) A d(y x ) < k A d(y 2 ) < k 

Fig. 1: A path in the automata expressing the invariant of the program which 
finds a key k in a sorted list. The full automaton is presented in Appendix E. 

We next explain, through an example, the construction of the structural 
constraints 4> p . Consider program list- find which searches for a key in a sorted 
list. The EQDA corresponding to the loop invariant learned for this program is 
presented in Appendix E. One of the simple paths in the automaton (along with 
the associated self-loops on b) is shown in Fig [TJ The structural constraint 4> p 
intuitively captures all valuation words which are accepted by the automaton 
along p; for the path in the figure 4> p is (cur = nil A h — !>+ y% A yi — > + yq) and 
the formula Vy%y2- (cur = nil Ah — >■+ y%Ayi — > + 2/2) (d(yi) < d(tj2) Ad(y\) < 
k A d(y2) < k) is the corresponding conjunct in the learned invariant. 



b 



b 



b 




Applying this construction yields the following theorem. 

Theorem 4. Let A be an EQDA, w a data word, and c the program configura- 
tion corresponding to w. If w £ C{A), then c |= T(A). Additionally, ifT(A) is 
a Strand formula, then the implication also holds in the opposite direction. 

APF allows the universal variables to be related by < or = and not <. Hence, 
along paths where y\ < 2/2, we over- approximate the structural constraint 4>p to 
2/i < 2/2 and, subsequently, the data formula is abstracted to include 

d{y\) — d{y2). This leads to an abstraction of the actual semantics of the QDA 
and is the reason Theorem @] only holds in one direction for the APF. 

7 Implementation and Evaluation on Learning Invariants 

We apply the active learning algorithm for QDAs, described in Section [U in a 
passive learning framework in order to learn quantified invariants over lists and 
arrays from a finite set of samples S obtained from dynamic test runs. 

Implementing the teacher. In an active learning algorithm, the learner can 
query the teacher for membership and equivalence queries. In order to build a 
passive learning algorithm from a sample set S, we build a teacher, who will use 
5* to answer the questions of the learner, ensuring that the learned set contains 
S. 

The teacher knows S and wants the learner to construct a small automaton 
that includes S; however, the teacher does not have a particular language of 
data-words in mind, and hence cannot answer questions precisely. We build a 
teacher who answers queries as follows: On a membership query for a word w, 
the teacher checks whether w belongs to S and returns the corresponding data 
formula. The teacher has no knowledge about the membership for words which 
were not realized in test runs, and she rejects these. By doing this, the teacher errs 
on the side of keeping the invariant semantically small. On an equivalence query, 
the teacher just checks that the set of samples S is contained in the conjectured 
invariant. If not, the teacher returns a counter-example from S. Note that the 
passive learning algorithm hence guarantees that the automaton learned will be 
a superset of S and will take polynomial time in the learnt automaton. We show 
the efficacy of this passive learning algorithm using experimental evidence. 
Implementation of a passive learner of invariants. We first take a pro- 
gram and using a test suite, extract the set of concrete data-structures that 
get manifest at loop-headers. The test suite was generated by enumerating all 
possible arrays/lists of a small bounded length, and with data- values from a 
small bounded domain. We then convert the data-structures into a set of for- 
mula words, as described below, to get the set S on which we perform passive 
learning. We first fix the formula lattice T over data formulas to be the Cartesian 
lattice of atomic formulas over relations {=,<,<}. This is sufficient to capture 
the invariants of many interesting programs such as sorting routines, searching 
a list, in-place reversal of sorted lists, etc. Using lattice T, for every program 
configuration which was realized in some test run, we generate a formula word 
for every valuation of the universal variables over the program structures. We 



represent these formula words as a mapping from the symbolic word, encoding 
the structure, to a data formula in the lattice T . If different inputs realize the 
same structure but with different data formulas, we associate the symbolic word 
with the join of the two formulas. 

Implementing the learner. We used the libALF library [22 as an imple- 
mentation of the active learning algorithm [T|. We adapted its implementation 
to our setting by modeling QDAs as Moore machines. If the learned QDA is 
not elastic, we elastify it as described in Section [SJ The result is then converted 
to a quantified formula over Strand or the APF and we check if the learned 
invariant was adequate using a constraint solver. 



Example 


#Test 




#Eq. 


#Mem. 


Size 


Elastification 






inputs 


00 






Testates 


required ? 


(s) 


array-find 


310 


0.05 


2 


121 


8 


no 


0.00 


array-copy 


7380 


1.75 


2 


146 


10 


no 


0.00 


array-comp 


7380 


0.51 


2 


146 


10 


no 


0.00 


ins-sort-outer 


363 


0.19 


3 


305 


11 


no 


0.00 


ins-sort-innner 


363 


0.30 


7 


2893 


23 


yes 


0.01 


sel-sort-outer 


363 


0.18 


3 


306 


11 


no 


0.01 


sel-sort-inner 


363 


0.55 


9 


6638 


40 


yes 


0.05 


list-find 


111 


0.04 


6 


1683 


15 


yes 


0.01 


list-insert 


111 


0.04 


3 


1096 


20 


no 


0.01 


list-init 


310 


0.07 


5 


879 


10 


yes 


0.01 


list-max 


363 


0.08 


7 


1608 


14 


yes 


0.00 


list-merge 


5004 


10.50 


7 


5775 


42 


no 


0.06 


list-partition 


16395 


11.40 


10 


11807 


38 


yes 


0.11 


list-reverse 


27 


0.02 


2 


439 


18 


no 


0.00 


list-bubble-sort 


363 


0.19 


3 


447 


12 


no 


0.01 


list-fold-split 


1815 


0.21 


2 


287 


14 


no 


0.00 


list-quick-sort 


363 


0.03 


1 


37 


5 


no 


0.00 


list-init-cmplx 


363 


0.05 


1 


57 


6 


no 


0.01 


lookup_prev 


111 


0.04 


3 


1096 


20 


no 


0.01 


add_cachepage 


716 


0.19 


2 


500 


14 


no 


0.01 


sort (merge) 


363 


0.04 


1 


37 


5 


no 


0.00 


insert .sorted 


111 


0.04 


2 


530 


15 


no 


0.01 


devres 


372 


0.06 


2 


121 


8 


no 


0.00 


rm_pkey 


372 


0.06 


2 


121 


8 


no 


0.00 


Learning Function Pre-conditions 


list-find 


111 


0.01 


1 


37 


5 


no 


0.00 


list-init 


310 


0.02 


1 


26 


4 


no 


0.00 


list-merge 


329 


0.06 


3 


683 


19 


no 


0.01 



Table 1. Results of our experiments. 



Experimental Results]]]. We evaluate our approach on a suite of programs 
(see Table _[} for learning invariants and preconditions. For every program, we 
report the the number of test inputs and the time (T tea cher) taken to build the 

1 Our prototype implementation along with the results for all our programs can be 
found at http://automata.rwth-aachen.de/~neider/learning_qda/ 



teacher from the samples collected along these test runs. We also report the 
number of equivalence and membership queries answered by the teacher in the 
active learning algorithm, the size of the final elastic automata, whether the 
learned QDA required any elastification and finally, the time (Ti earn ) taken to 
learn the QDA. 

The array programs are programs for finding a key in an array, copying and 
comparing two arrays, and inner and outer loops of insertion and selection sort 
over an array. The list programs find and insert a key in a sorted list, initialize a 
list, return the maximum data value in a list, merge two disjoint lists, partition a 
list into two lists depending on a predicate and reverse in-place a sorted list. The 
programs bubble-sort, fold-split and quick-sort are taken from [10]. The program 
list-init-cmplx sorts an input array using heap-sort and then initializes a list 
with the contents of this sorted array. Since heap-sort is a complex algorithm 
that views an array as a binary tree, none of the current automatic white-box 
techniques for invariant synthesis can handle such complex programs. However, 
our learning approach being black-box, we are able to learn the correct invariant, 
which is that the list is sorted. Similarly synthesizing post-condition annotations 
for recursive procedures like merge-sort and quick-sort is in general difficult for 
white-box techniques, like interpolation, which require a post-condition. 

The methods lookupjprev and add_cachepage are from the module cacheP- 
age in a verified-for-security platform for mobile applications |23j . The module 
cachePage maintains a cache of the recently used disc pages as a priority queue 
based on a sorted list. The method sort is a merge sort implementation and 
insertsorted is a method for insertion into a sorted list. Both these methods are 
from Glib which is a low-level C library that forms basis of the GTK+ toolkit 
and the GNOME environment. The methods rfeirejl and rm_pke^ are methods 
from the Linux kernel and an Infiniband device driver, both mentioned in [6]. 

All experiments were completed on an Intel Core i5 CPU at 2.4GHz with 
6GB of RAM. For all examples, our prototype implementation learns an ade- 
quate invariant really fast. Though the learned QDA might not be the small- 
est automaton representing the samples S (because of the inaccuracies of the 
teacher), in practice we find that they are reasonably small (less than 50 states). 
Moreover, we verified that the learned invariants were adequate for proving the 
programs correct by generating verification conditions and validating them us- 
ing an SMT solver (these verified in less than Is). Learnt invariants are complex 
in some programs; for example the invariant QDA for the program list-find is 
presented in Appendix E and corresponds to: 

head ^ nil A (iy 1 y2-head — >* y 1 — !>* j/2 => d(yi) < d(j/2)) A ((cur = nil A Vj/i. head — >* 
j/i => d(yi) < k) V (head — >* cur A Vj/i./iead — yi — > + cur =$■ d(yi) < k)). 

Future Work: We believe that learning of structural conditions of data- 
structure invariants using automata is an effective technique, especially for quan- 
tified properties where passive or machine-learning techniques are not currently 
known. However, for the data-formulas themselves, machine learning can be very 

2 method pcim_iounmap in Linux kernel at linux/lib/devres . c 

3 from InfiniBand device driver at drivers/inf iniband/hw/ipath/ipath_mad. c 



effective |18j . and we would like to explore combining automata-based structural 
learning (for words and trees) with machine-learning for data-formulas. 
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Appendix A 



We show, through an example, that on the level of data languages we cannot 
hope for unique minimal QDA. Consider the QDA in Figure [5] over PV = 
and Y — {2/1,2/2}- It accepts all valuation words in which d(yx) < d(j/2) if j/i is 
before 1/2 and j/i, 2/2 are both on even positions, and all valuation words in which 
J/2 < J/i or at least one of j/i , j/2 is not on an even position. Hence, the data 
language define by this QDA consist of all data words such that the data on the 
even list positions is sorted. Since the QDA has to check that each variable occurs 
exactly once, the number of states is minimal for defining this data language. 

However, the same data language would also be defined if the ^-transition 
from (73 would be redirected to q§. Then the sortedness property would only be 
checked for all j/i,?/2 with j/2 = J/i + 2, which is sufficient. This shows that the 
transition structure of a state minimal QDA for a given data language is not 
unique. 




Fig. 2: A QDA expressing the property that the data on the even positions in 
the list is sorted. 



Appendix B 

Angluin pQ introduced a popular learning framework, which is originally designed 
to learn regular languages. In this framework, a learner (or learning algorithm) 
learns a regular language L, the so-called target language, over an a priory fixed 
alphabet S by actively querying a teacher. The teacher is capable of answering 
two different kinds of queries: membership and equivalence queries. On a mem- 
bership query, the learner presents a word w e Z", and the teacher replies "yes" 
or "no" depending on whether w belongs to L or not. On an equivalence query, 
the learner conjectures a regular language H C S* , typically given as a finite 
automaton, and the teacher checks whether H is an equivalent description of 
L. If this is the case, he replies "yes". Otherwise, he returns a counter-example 
w e L <^> w H. 

In [1] , Angluin presented a learning algorithm that learns a regular language 
in time polynomial in the size of the (unique) minimal deterministic finite au- 



tomaton accepting the target language and the length of the longest counter- 
example returned by the teacher. Angluin's algorithm maintains a prefix-closed 
set S C E* , a suffix-closed set E C E*, and stores the learned data in a table 
(realized as a mapping T: (S U SE)E — > {0, 1}), whose rows are labeled with 
strings from 5 and whose columns are labeled with string from E. The key idea 
of the algorithm is to approximate the Nerode congruence of the target language 
using strings from S as representatives for the equivalence classes and strings 
from E as samples to distinguish these classes. New strings are added to S and 
E whenever necessary until an equivalence query reveals that the conjectured 
automaton is equivalent to the target language. 

Although originally introduced to learn regular languages, this algorithm can 
be easily lifted to the learning of Moore machines. In this setting, the "target 
language" is a finite-state computable mapping A: E* r (i.e., a mapping 
computable by a Moore machine) that maps each word w 6 E* to some output 
X(w) taken from a finite set r of output symbols. (We obtain Angluin's setting 
for r — {0, 1}.) Moreover, membership queries ask now for the output — or 
classification — of a word rather then whether it belongs to a language or not. 
Finally, on an equivalence query, the learner proposes a Moore machine Ai. If M 
is not equivalent to the target language, the teacher returns a counter-example 
w such that the output of M. on w is different from \(w). 

Adapting Angluin's algorithm to work with Moore machines is straightfor- 
ward. Since the Nerode congruence can easily be lifted to the Moore machine 
setting, it is indeed enough to change the table to a mapping T: (SUSE)E — > r; 
everything else can be left unchanged. This adapted algorithm also learns the 
minimal Moore machine for the target language in time polynomial in this min- 
imal Moore machine and the length of the longest counter-example returned by 
the teacher. 

Appendix C 

In this appendix we describe, in a greater detail, the translation from an EQDA 
A to a formula T(A) expressed in Strand or the APF such that the set of data 
words accepted by A corresponds to the program configurations C which model 
T(A). 

Recall the formal definition of an EQDA from Section [5J In an EQDA A = 
(Q, qo, 77, 6, /) over program variables PV and universal variables Y, each tran- 
sition on b is a self loop. Without restricting the class of languages accepted 
by A we assume, for the purpose of translation, that our EQDAs have three 
additional properties. 

Firstly, we assume that any path in the EQDA, along which a universal 
variable occurs together with auxiliary variables like nil which are introduced by 
the encoding from Section[BJ goes to the formula true. This does not change the 
language accepted by the automaton as it still accepts all data words respecting 
the formula constraints for other valuations of the universal variables and where 
the data at these auxiliary variables can be any data value. 



b b 



(M) A )X(b,y) ^-v 

— Vv "W — * — to- — — > 

(a) (b) 
Fig. 3: Base cases for detecting irrelevant self-loops. 



Secondly, we assume that the EQDA has no irrelevant loops which are defined 
inductively as follows: fix a simple path p of an EQDA A that leads from the 



initial to an accepting state, and on p consider a state q\ (see Fig 3(a) I which 
reads a universal variable (b, y) on a transition to state qi- If q<i has a self loop 
on the blank symbol, i.e. 8(q2,b) = q2, then this loop is inductively defined to 
be irrelevant on p if q\ has no self-loop, or if the self-loop at q\ is also irrelevant 
on p. Symmetrically, a self-loop at q\ is irrelevant on p if qi has no self-loop 
or has one which is irrelevant on p (see Fig |3(b)[ ). If a self- loop is irrelevant 
on p, then it can be omitted for words accepted along the path p. To see why, 
consider a valuation word v = . . . (b, y) b . . . that is accepted along p using the 



self-loop in Figure 3(a) A different valuation v' = . ..b (b,y) ... of the same 
data word is rejected by A since q\ has no transition on b. Hence, the data word 
corresponding to v is not accepted by A. We can remove irrelevant loops from 
A without changing the accepted data language by simply removing those loops 
that are irrelevant on each path they occur on, or by splitting states if they have 
a self-loop that is irrelevant only on some paths. 

Thirdly, we assume that the universal variables are read by the EQDA in a 
particular order and all paths in the EQDA that do not respect this order lead 
to the formula true. The translation that we give below considers each path of 
the automaton separately. Thus, if the automaton does not satisfy the above 
property, then for any path that does not read the variables in the correct order 
we rename the variables on the transitions and in the data formulas along that 
path accordingly before the translation. 

Let us now turn to the translation of the paths. We observe that all variables 
appear exactly once in any valuation word accepted by A. Since we disallow 
universal variables to appear together, this is ensured by adding some dummy 
symbols where these variables can appear in case the valuation word is too short. 
A consequence of this property is that there can be no cycle in our EQDA model 
which shares an edge labeled with a (universal or pointer) variable. Consider a 
simple path p of the automaton from the initial state qo to the output state q p , 
q . . . 7r "~ 1 > q p (tt- g II ^=b). Below we informally describe the translation T 
from path p to a formula <p p which captures the relative positions of the pointer 
and universal variables along p and forms the guard of a universally quantified 
implication in a conjunct of the translated formula. At a higher level, whenever 
a state q in path p has a self-loop on the blank symbol 6, pointers and universal 
variables Ui, V2 S PV U Y labeled along the incoming and outgoing transitions 
of this state are constrained by the relation v\ < «2 or v\ — s> + v%. The presence 
of a self loop ensures that the variables are related by an elastic relation which 
is required for decidability in Strand and APF. On the other hand, if q has 



no transition on 6, then the pointers labeled along the incoming and outgoing 
transitions are constrained by the successor relation. Note that successor is an 
inelastic relation and is not allowed to relate two universal variables. In this case 
we identify a state q' on path p, closest to q, which has a transition on some 
pointer (non-universal) variable pv. Since we have already stripped our EQDA 
of all irrelevant loops, the subpath from q' to q has no self-loops. Thus, the 
universal variables at q can be constrained to be a fixed distance away from the 
pointer pv. This is allowed in APF using arithmetic on the pointer variables. For 
Strand, the same effect can be obtained by introducing a monadic predicate 
which tracks the distance of the universal variable from the pointer variable pv. 

We skip a formal description of the translation. A subtle case to note, how- 
ever, is when a state q in path p has a self loop on the blank symbol b and 
the incoming and outgoing transitions on q are both labeled by letters of the 
form (b, y) where y E Y. Unlike Strand, APF forbids two adjacent universal 
variables j/1,2/2 to be related by <. And so for the case of arrays, translation 
T{p) constrains these universal variables as y\ < y%. Moreover, we modify the 
output of the final state along this path f^.(q p ) to include the data constraint 
d{yi) — d{y2) if it was not already implied by the output formula. Note that at 
this point the constraint does not capture the exact semantics of the automaton. 

The universally quantified formula that is captured by this particular path 
p is VY. 4> p /^((?p). We construct these implications for all simple paths in 
the EQDA and conjunct them to get the final formula. All other paths in A 
semantically go to false. Hence, we also add a conjunct VY. _, (V p ^ ) p) =^ false. 
So for an EQDA A, T(A) = /\ p W. <\> v => f A (q p ) A VT. -(V p 0p) => false. 
Since negation is arbitrarily allowed over atomic formulas in Strand, T(A) is 
clearly in the decidable fragment of Strand. APF also allows negation over 
atomic formulas which relate two pointer variables or a universal variable with a 
pointer variable. However, negation of an atomic formula y\ < y2 is not allowed 
in APF. But since we assume for the translation that the automaton considers 
a fixed variable ordering on Y and all other paths with a different ordering lead 
to true, we can simply remove negations of formulas y\ < ?/2 from ' p 4> P - 

Appendix D 

In this appendix, we sketch how an active learning algorithm can be used to learn 
program invariants expressible in the array property fragment and the Strand 
decidable fragment over lists. Invariant synthesis can be achieved using two dis- 
tinct procedures: (a) building the learner according to the learning algorithm 
described in Section |4j and (b) building a teacher which can answer questions 
about invariant for a particular program. An acceptable invariant for a program, 
in general, has to satisfy three properties: it must include the pre-condition, it 
must be contained in the post-condition, and it must be inductive. Moreover, in 
order to certify the above indeed hold, the invariant should be expressible in a 
logic that permits a decidable satisfiability problem for the above conditions. 



Building an adequate teacher is not easy as the invariant is unknown, and 
the whole point of learning is to find the invariant. Still the teacher certainly 
has some knowledge about the set of structures in the invariant and can answer 
certain questions with certainty. For example, when asked whether a data word 
w belongs to the invariant /, if w belongs to the pre-condition (or the strongest 
post-condition of the pre-condition), she can definitely say that w belongs to /. 
Also, when w belongs to the negation of the post-condition (or to the weakest 
pre-condition of the negated post-condition), the teacher can definitely answer 
that w does not belong to /. For other queries, in general, the teacher gives 
arbitrary answers and these answers determine the kind of invariant that is fi- 
nally learned. Turning to equivalence queries, if the learned invariant falls within 
a decidable fragment (as is ensured by the above learning algorithm) and the 
pre/post-condition and the program body is such that the verification conditions 
are expressed in appropriate decidable logics (Strand/APF), a teacher is able 
to check if the conjectured invariant is adequate and satisfies the above three 
conditions. If the invariant is inadequate and does not include the pre-condition 
or intersects the negation of the postcondition, then the teacher can find an ap- 
propriate counterexample to report to the learner. If the inadequacy is due to 
the conjecture not being inductive, then the teacher would find a pair of con- 
figurations (c, d) such that c is allowed by the conjecture while d is reachable 
from c and is excluded from it, and decide to either report c or d as a counterex- 
ample. This choice again determines the final invariant being learned, similar to 
membership queries that the teacher is unsure about. 

The idea is to pit such a teacher against a learner in order to learn the in- 
variant, despite the fact that the teacher does not know the invariant herself. 
The learner's objective is to learn the simplest data automaton that captures 
the knowledge the teacher has. The key property that the learner relies on is Oc- 
cam's razor — that the simplest set (i.e., the automaton with the least number of 
states) consistent with the queries answered by the teacher is a likely invariant. 
Note that the learner will not, in general, simply learn an automaton that cap- 
tures precisely the knowledge of the inadequate teacher; representation of this 
knowledge is often far more complex than a true invariant. In other words, the 
learner will learn the simplest automaton that generalizes the partial knowledge 
the teacher has. 
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d(yi)<d(y 2 )/\ d( yi )<d(y 2 )/\d( yi )<k d( yi )<d(y 2 ) 

d( yi )<kAd(y 2 )<k 



Fig. 4: The EQDA expressing the invariant of the program which finds a key k in 
a sorted list. Here head and cur are pointer variables and k is an integer variable 
in the program. 
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